BeautifulSoup supports the most commonly used CSS selectors, which is the. Select () method that converts a string into a tag object or BeautifulSoup itself.The HTML used in this article is:Html_doc = "" "html>head>title>The Dormouse ' s storytitle>head>body>p class="title">b>The Dormouse ' s storyb>P>p class="story">Once upon a time there were three Little sisters; and their names werea href="Http://example.com/elsie" class="Sister" ID ="Link1">Elsiea>,a href="Http://example.com/
See examples directly:
The code is as follows:
#!/usr/bin/python#-*-Coding:utf-8-*-From BS4 import BeautifulSoupHtml_doc = "" "The Dormouse ' s story
The Dormouse ' s story
Once upon a time there were three Little sisters; and their names wereElsie,Lacie andTillie;And they lived at the bottom for a well.
...
"""
Soup = BeautifulSoup (Html_doc)
Print Soup.title
Print Soup.title.name
Print soup.title.string
Print SOUP.P
Print Soup.a
Print Soup.find_
First, find a label(1) Find all a tags>>> forXinchSoup.find_all ('a'): Print (x)class="Sister"href="Http://example.com/elsie"Id="Link1">elsieclass="Sister"href="Http://example.com/lacie"Id="Link2">lacieclass="Sister"href="Http://example.com/tillie"Id="Link3">Tillie(2) Find all a tags, and the attribute value href need to protect the keyword "" for in Soup.find_all ('a', href = re.compile ('lacie' ): pr
False, then the label with the true result is selected, and even a regular expression can be passed.Soup.find_all ('b')#[ImportRe forTaginchSoup.find_all (Re.compile ("^b")): Print(Tag.name)#Body#bSoup.find_all (["a","b"])#[# " id= "Link1" >ELSIE# " id= "Link2" >LACIE#defhas_class_but_no_id (tag):returnTag.has_attr ('class') and notTag.has_attr ('ID') Soup.find_all (has_class_but_no_id)#[##Any unrecognized parameter will be converted to the tag's
BeautifulSoup is a tool for parsing crawled content, and its find and Find_all methods are useful. And after parsing, it will form a tree structure, for the Web page form a similar to the JSON format of the Key-value, it is easier and more convenient for the content of the Web page operation. Download the library without much to say, using the python pip, directly inside the cmd execute pip install BeautifulSoup can First copy the document description, the code is copied over, as follows fromB
dormouse ' s story#Print the P tag in soup, but here's the first one you can find#print (SOUP.P)##Print the P tag class name in soup, but here's the first one you can find#Print (soup.p[' class '],type (Soup.p[' class '))#[' title '] #print the A tag in soup, but here's the first one you can find#print (SOUP.A)# " id= "Link1" >Elsie#Print all the A labels#Print (Soup.find_all (' a '))#[ " id= "Link1" >ELSIE# " id= "Link2" >LACIE##Print a id=link3 lab
There are a lot of LightScribeLabeler production software on Windows, such as SonicExpressLabeler, NeroCoverDesiner, SureThingCDLabeler, RoxioEasyMediaCreator (only NeroCoverDesigner), because Windows XP of VirtualBox cannot recognize its own Recorder, you cannot use many LightScribe Labeler production software on the NeroCoverD Windows platform, such as Sonic Express Labeler, Nero Cover Desiner, SureThing CD Labeler, and Roxio Easy Media Creator (only used Nero Cover Designer ), because Windows
Bdist_eggrunning egg_infowriting requirements to Beautifulsoup4.egg-info/requires.txt ...Under test[Email protected]:~/soft/python-source/beautifulsoup4-4.4.1$ Pythonpython2.7.8 (Default, Oct 20 2014, 15:05:19) [GCC4.9.1] on Linux2type" Help","Copyright","credits" or "License" forMore information.>>> fromBs4ImportBeautifulSoup>>>Introducing the BeautifulSoup package into the Python environment, as shown above, demonstrates success.See examples directly:#!/usr/bin/python#-*-Coding:utf-8-*-From
, a regular expression, a list, and True. See Example:Soup.find_all (text= "Elsie") # [u ' Elsie ']soup.find_all (text=["Tillie", "Elsie", "Lacie"]) # [u ' Elsie ', U ' Lacie ', U ' Tillie ']soup.find_all (text=re.compile ("Dormouse")) [u "the Dormouse ' s story", U "the Dormouse's story"]def Is_the_only_ String_within_a_tag (s): "" "Return True if this string was the only child of its parent tag." "
Sisters; and their names were and " id= "Link3" >Tillieand they lived at the bottom of a well."""#创建 Beautiful Soup ObjectsSoup=BeautifulSoup (HTML)#打开本地 HTML file to create an object#soup = beautifulsoup (open (' index.html '))#格式化输出 the contents of a Soup objectPrintSoup.prettify ()Operation Result: The Dormouse ' s story class="title"name="Dromouse"> The Dormouse ' s story class="Story">Once upon a time there were three Little sisters; and their names wereclass="Sister"href="Http:/
) = = 6Soup.find_all (Class_=has_six_characters)The class property of tag is a multivalued property. When searching for tags by CSS class name, you can search each CSS class name in tag individually:Css_soup = BeautifulSoup (' Css_soup.find_all ("P", class_= "strikeout")Css_soup.find_all ("P", class_= "body")You can also search for the class property exactly by CSS valuesCss_soup.find_all ("P", class_= "body strikeout")If the order of the CSS class name does not match the actual value of class,
See examples directly:
Copy Code code as follows:
#!/usr/bin/python
#-*-Coding:utf-8-*-
From BS4 import BeautifulSoup
Html_doc = "" "
And they lived at the bottom of a well."""
Soup = BeautifulSoup (Html_doc)
Print Soup.title
Print Soup.title.name
Print soup.title.string
Print SOUP.P
Print Soup.a
Print Soup.find_all (' a ')
Print Soup.find (id= ' Link3 ')
Print Soup.get_text ()
The results are:
Copy Code code as follows:
Title
The Dormouse ' s story
to obtain the properties of a label through the Get function:
Soup=beautifulsoup (HTML, ' Html.parser ')
pid = Soup.findall (' A ', {' class ': ' sister '}) for
i-PID:
Print i.get (' href ') #对每项使用get函数取得tag属性值
http://example.com/elsie
http://example.com/lacie
http:// Example.com/tillie
The other labels are also available, and the output is the first matching object in the document, and the Find FindAll function is required if you want to searc
Outputer (): Def __init__ (self): self.datas=[] def collect_data ( Self,data): If data is None:return self.datas.append (data) def output (self): Fout =open (' output.html ', ' W ', encoding= ' utf-8 ') #创建html文件 fout.write ('
Additional explanations for the beautifulsoup of the Web page parser are as follows:
Import re from BS4 import beautifulsoup html_doc = "" The results were as follows:
Get all links with a
Http://example.com/elsie Elsie a
http://example.com/
: fromBs4ImportBeautifulSoupImportre Html_doc="""""" Print('get all the links') Links= Soup.find_all ('a') forLinkinchLinks:Print(link.name,link['href'],link.get_text ())Print('get a link to Lacie') Link_node= Soup.find ('a', href='Http://example.com/lacie') Print(link_node.name,link_node['href'],link_node.get_text ())Print('regular Match') Link_node= Soup.find ('a', Hre
bolddel tag[' class ']del tag[' id ']tag#extremely boldtag[' class ']# Keyerror: ' Class ' Print (Tag.get (' class ')) # None
You can also find DOM elements in a random way, such as the following example
1. Build a document
Html_doc = "" The Dormouse ' s storythe dormouse ' s storyonce upon a time there were three Little sisters; And their names Wereelsie,lacie Andtillie;and they lived at the bottom of a well .... "" "from BS4 import Beautifulsoups
Example:HTML file:Html_doc = "" "dormouse ' s story"""Code:From BS4 import BeautifulSoupSoup = BeautifulSoup (Html_doc)Next you can start using a variety of featuresSoup. X (x is any label, returns the entire label, including the label's attributes, contents, etc.)such as: Soup.title# Soup.p# dormouse ' s storySOUP.A (Note: Only the first result is returned)# Soup.find_all (' a ') (Find_all can return all)# [# # find can also be found by attributesSoup.find (id= "Link3")# to fetch a property of
pythonSoup.find_all ('Div', class_='ABC', string='Python')#because class is the keyword of Python, in order to avoid conflicts using Class_3. Accessing the node's information# Example: Get node # gets the label name of the node to find Node.name # gets the href attribute of the A node found node['href']# Gets the link text of a node found to Node.get_text ()BeautifulSoup Example Demo:#-*-coding:utf-8-*-ImportOSImportRe fromBs4ImportBeautifulsouphtml_doc=""""""Print 'get all the A links:'Soup=
#-*-coding:utf-8-*-#Python 2.7#Xiaodeng#http://tieba.baidu.com/p/2460150866#Label Operations fromBs4ImportBeautifulSoupImporturllib.requestImportRe#if it is a URL, you can use this method to read the page#Html_doc = "http://tieba.baidu.com/p/2460150866"#req = urllib.request.Request (html_doc)#webpage = urllib.request.urlopen (req)#html = webpage.read ()HTML=""""""Soup= BeautifulSoup (HTML,'Html.parser')#Document Object#re.compile to match the href address that needs to be crawled forKinchSoup.fi
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.